From 167GB OOM to Real-Time on a GTX 1060: Building a Hairstyle AI

By Dubido | January 18, 2026

Building a generative AI application is easy when you have a cluster of H100s. The real engineering starts when you have a single NVIDIA GTX 1060 (6GB VRAM) and a user who just uploaded a 4K photo.

This post documents my journey building a Virtual Hairstyle Try-On app. We started with the bleeding-edge FLUX.1 model and ended up with a highly optimized Stable Diffusion 1.5 pipeline. Here are the pitfalls we hit and the lessons we learned.


The Ambition: "FLUX or Nothing"

Our initial goal was simple: Use the latest SOTA model, FLUX.1-Fill-dev, for high-fidelity inpainting. The architecture seemed straightforward: 1. Frontend: React + Vite (Camera/Upload). 2. Backend: FastAPI. 3. Preprocessing: SegFormer for hair masking. 4. Inference: FLUX.1 for generation.

Pitfall #1: The "Gated Repo" (403 Forbidden)

The first wall we hit wasn't technical, it was bureaucratic.

Error loading FLUX model: 403 Client Error... Access to model black-forest-labs/FLUX.1-Fill-dev is restricted.

The Fix: FLUX is a gated model. We had to go to Hugging Face, accept the license, generate a Read-Token, and inject it via python-dotenv.

Pitfall #2: The 12-Billion Parameter Elephant

Once authenticated, we hit the physical limits of the GTX 1060. FLUX is massive. * Attempt 1: Standard loading (bfloat16). Result: Immediate OOM (Out of Memory). * Attempt 2: 4-bit Quantization (bitsandbytes NF4). Result: Still crashed System RAM. * Attempt 3: enable_sequential_cpu_offload(). Result: It ran, but at a glacial pace.

Lesson: SOTA models are great for research, but for a responsive app on consumer hardware, model size matters more than raw generation quality.


The Pivot: Embracing Stable Diffusion 1.5

We decided to downgrade the engine to Stable Diffusion 1.5 Inpainting. It's older, but it's efficient, robust, and designed for 512x512 resolution—perfect for a 6GB card.

Pitfall #3: The "No Kernel Image" (Architecture Mismatch)

We installed the latest PyTorch, only to be greeted by this cryptic error:

torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device

The Root Cause: The default PyTorch build dropped support for Pascal architecture (sm_61).

The Fix: We downgraded PyTorch to a compatible version (v2.5.1/v2.6.0) that supports older CUDA compute capabilities.

Pitfall #4: The safetensors Trap

In an attempt to be secure, we enforced use_safetensors=True.

OSError: Could not find the necessary `safetensors` weights...

The Fix: The SD 1.5 repo uses legacy .bin weights. We had to relax our constraints to load them.


The Final Boss: The "167GB" Error

Everything was running. Then, we tested it with a 4K photo.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 166.97 GiB.

The Physics of Attention: Memory complexity is $O(N^2)$. A 4K image is 36x more pixels than 512px, but requires ~1300x more memory for the attention matrix.

The Solution: Preprocessing is King

We implemented a strict "Gatekeeper" in the backend to resize the longest edge to 512px while maintaining aspect ratio. 1. Reduced VRAM usage from 167GB -> ~4GB. 2. Kept face geometry correct. 3. Achieved generation times of ~5-8 seconds on the GTX 1060.


Conclusion

  1. Know your hardware: A GTX 1060 cannot run FLUX in real-time.
  2. Quantization helps, but architecture wins.
  3. Sanitize Inputs: Never pass raw user input (like 4K images) directly to a neural network.

Happy hacking!